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Abstract: Machine learning, a subset of Artificial Intelligence, has gained much 
recognition in facilitating disease prediction and the decision-making process in 
healthcare. One of the most often diagnosed developmental disorders in the world 
is Autism Spectrum Disorder (ASD). Around the world, it is reported to afflict 75 
million people and the number of cases has gradually increased since studies 
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began in the 1960s. The symptoms generally include communication deficits, 
sensory processing differences, and repetitive actions or behaviors. This research 
develops a model to detect ASD using Principal Component Analysis and 


Machine Learning algorithms to classify and predict the risk of ASD among 
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pregnant women. Data was collected from National Hospital in Abuja, Nigeria. 
According to the results, PCA and Random Forest produced the best accuracy of 
98.7%. Bayesian probability was employed to evaluate and verify the reliability of 
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Introduction 
Autism spectrum disorder (ASD), a 
disorder disease has a long-term impact on an individual's 


neuro- 


capacity to engage and interact with others. Since 
symptoms of autism typically show in the initial phase of 
a child’s life, it is considered a "behavioral disease" and 
can be diagnosed at any point in life (Raj et al., 2020). 
One of the most often diagnosed developmental disorders 
today is ASD. Around the world, it is reported to afflict 
75 million people, and the number of cases has gradually 
began in the 1960s. The 
symptoms generally include communication deficits, 


increased since studies 


sensory processing differences, and repetitive actions or 
behaviors. It also exists as a spectrum divided into 3 
levels, each based on the severity of these symptoms 


the model. The created model can aid doctors in diagnosing ASD. 


(Hugues et al., 2021). Scientists are focusing on strategies 
to identify ASD as early as possible, as early intervention 
of therapy is crucial for children with autism. Today, a 
reliable diagnosis can be made as early as the child is 2. 
No simple diagnostic tool is available, such as X-rays 
for fractured bones or blood testing for diabetes. It is 
more difficult to diagnose autism because it can only be 
done based on behavior. Parents and doctors frequently 
miss out on children with less severe symptoms; more 
severe cases may look like other developmental 
abnormalities (Amin et al., 2023). Researchers at the 
University of Limoges in France recently created a 
computer program that uses fetal traits to predict whether 
a child will be diagnosed with autism as early as one day 
after birth. Early detection of autism would enable 
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families to start the communicative, social, and sensory 
therapy that may be essential to the development of the 
autistic child (Hugues et al., 2021). 

Technology has been used in the health sector for 
purposes like monitoring, evaluation, awareness etc. 
Machine learning, a subset of artificial intelligence, is the 
most effective tool for achieving these goals, whether the 
objective is to predict reality or find hidden patterns, 
using patterns to uncover some hidden patterns of 
future data (Srivastava and Tripathi, 2023; Singh and 
Sharma, 2023; Singh et al., 2023). ML methods are very 
effective in disease recognition (Haloi et al., 2023; 
Gajbhiye et al., 2023). These diseases may be related to 
plants (Kumar et al., 2021; Rukhsar et al., 2022; 
Upadhyay et al., 2021; Upadhyay et al., 2022) or animals. 
However, as a machine learning model's efficiency 
frequently depends on the training data and corresponds 
to the computer term "garbage in, garbage out," the 
method used to collect data frequently affects the model's 
performance (Zhang et al., 2021). Inconsistent data, 
defined as incorrect data entry, is the most common issue 
in data collection. To thrive in this discipline, feature 
are used to 
manipulate data when developing models. Feature 
selection is "reducing the height of the data by finding a 
set of features that describe the data well" (Khaire et al., 
2019). Considering the observations, feature selection is 
critical to ensuring that only crucial features are utilized 


engineering and selection techniques 


in the design to reduce the number of disparities between 
data and optimization for well-defined prediction models. 

Autism is a systemic disorder that affects the brain 
and is not a genetic brain disorder. Certain genes are 
activated in people predisposed to this disorder by a 
hazardous environment. ASD is a neurological condition 
that causes children to have a lifelong dysfunction that 
leads to mental disease (Kundu, 2019). As a result, 
precautions should be taken to diagnose the illness as 
early as possible (Marin et al., 2019). In the field of 
clinical and scientific research, autism is still exceedingly 
challenging to diagnose during pregnancy or right after 
birth. Compared to statistical analysis, using machine 
learning classifiers has been proven to increase the 
accuracy of health prediction (Sivaram, 2022). Stress has 
emerged to be an integral part of every person's life in 
today's competitive and rival world, affecting an 
individual directly or indirectly in many traditions (Mittal 
et al., 2022). The results of predictive classification 
algorithms need to be verified using real-world 
information. Overcoming the problem of interpretability 
among health practitioners is one concern. When 


interpretability is good, it is simpler for healthcare 
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professionals who must use it to trust the conclusions 
made by such models. This study aims to fill in these 
knowledge gaps by building a model to predict autism 
spectrum disorder utilizing PCA and machine learning 
classification analysis. 

Design Science (DS) research aims to generate 
prescriptive knowledge about the design of artifacts, such 
as software, methods, models or concepts. It includes six 
steps: problem identification and motivation, objectives, 
design and development, demonstration, evaluation, and 
communication.This paper explored the concept of DS 
and the impact of feature engineering algorithms such as 
scikit-learn random samplers to generate synthetic data 
for predictive models where data is weak or inconsistent 
due to many outliers. In addition, a comparative analysis 
of the literature has been done to identify the most 
effective model for predicting ASD (Viloria et al., 2020). 
The models identified in literature with high performance 
will be used on the dataset, using the same model 
evaluation metrics in the literature to draw a valid 
conclusion. 

The aimof this study is to build a machine- 
learning prediction model for autism spectrum disorders. 
Following are the objectives of this study: 

i. Carry out an Exploratory Data Analysis (EDA) on 

the dataset. 

ii. Build a model for prediction using PCA and 
machine learning classification algorithms for ASD 
on baseline data. 

Validate the model improve performance 
accuracy using a k-fold cross-validation technique. 
Employ Bayesian probability to assess the model's 


to 


Iv. 
reliability. 

v. Compare the best-performing model with a similar 
existing model in the literature. 

This paper is summarized as follows: Section 2 
presents a literature review, Section 3 presents the 
methodology used in the research, Section 4 presents 
experimental results and finally, Section 5 concludes the 
work. 

Literature Review 

This section looks at the various definitions of key 
concepts and works that show how other researchers have 
predicted autism spectrum disorder using machine 
learning techniques. 

Autism Spectrum Disorder (ASD) 

A neuro-disorder known as ASD has a long-term 
impact on an individual's capacity to engage and interact 
with others. Autism is a "behavioral disease" as 
symptoms typically develop in the initial phase of a 
child’s life, but it can be diagnosed at any point in one's 
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life. ASD claims that the issue begins in childhood and 
persists into adolescence and adulthood (Raj et al., 2020). 
They examined retrospective ultrasound and biological 
measurements of infants diagnosed later with ASD or 
neuro typical (NT) that are regularly gathered during 
pregnancy and birth to identify infants at risk of 
developing ASD and to detect ASD biomarkers early 
after birth (Caly et al., 2021). An automated ASD 
prediction model was created by (Vakadkar et al., 2021) 
using minimal behavioural sets taken from each diagnosis 
dataset. By using the Q-CHAT-10 dataset, the developed 
model predicts autism spectrum disorder with 93.84%, 
81.52%, 94.79%, 97.15%, and 90.52% accuracy in the 
case of Support Vector Machines, Random Forest 
Classifier, Naive Bayes, Logistic Regression, and K 
Nearest Neighbour, respectively. Comparing all five 
supervised machine learning algorithms, Logistic 
Regression and Naive Bayes algorithms had the best 
accuracy in the detection of autism spectrum disorder. 

ASD is a disorder that affects one in a hundred kids 
worldwide. ASD _ considered the effects of 
socioeconomic, ethical, and regional characteristics on 
prevalence estimates. Estimates of prevalence changed 
over time and were very variable both within and 
between socio demographic groupings. Researchers have 
made significant progress in the development of health 
information systems, especially standardized clinical 
coding systems (Mahdi et al., 2023). For guiding public 
policy, increasing awareness, and establishing research 
goals, prevalence estimates of autism are crucial. These 
results reflect modifications to the criteria of autism as 
well as variations in the methods and environments used 
in prevalence studies (Zeidan et al., 2022). Patients with 
autism deal with a variety of difficulties, including 
attention problems, learning disabilities, mental health 
issues including anxiety, depression, etc., motor 
difficulties, sensory issues, and many others. Autism is 
currently on the rise in large numbers and at a rapid rate 
throughout the world. The World Health Organization 
(WHO, 2017) estimates that one in 160 infants suffers 
from ASD. While some people with the disorder can live 
independently, others need care and support for the rest 
of their lives. Autism diagnosis takes a long time and 
costs a lot of money. Early diagnosis of autism is very 
crucial in terms of providing patients with the right 
medicine at the right time. It could stop the patient's 
illness from getting critical and could lower long-term 
expenses bring on by a delayed diagnosis (Omar et al., 
2019). 
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Machine Learning 


Machine Learning classifiers have proved to improve 
health prediction accuracy over statistical analysis 
(Sivaram, 2022; Upadhyay et al., 2024). One of the 
foundation features of machine learning is to learn to 
train themselves for different circumstances, reducing the 
need for manual human involvement as much as possible. 
In modeling development, there are the phases of testing 
and training. Learning occurs during training by giving it 
a dataset and an algorithm. ML methods automatically 
identify important predictive features and predict risks 
from early pregnancy (Liu et al., 2022). Currently, 
electronic health records (EHRs) offer detailed evidence 
about demography, 
medications, laboratory 


patient's medical history, 
results, and doctor's 
diagnosis (Odu et al., 2022), which serves as a dataset in 


test 


most healthcare research. A dataset is a collection of 
complete, usually massive, data used to train the system. 
Datasets are an important part of every machine learning 
system, and using ML techniques may help enhance the 
prediction model (Shtar et al., 2022). It must be broad 
and large enough to allow the system to learn to work 
under various conditions. Another key aspect is the 
algorithm, which outlines how the system is expected to 
understand the presented dataset. Popular, basic 
algorithms like Regression, Random Forests, Decision 
Trees, and so on are utilized in various applications. The 
now-trained system is tested using various data records 
during the testing phase, and the accuracy of the 
predictions is utilized for training the system further, 
making it a continuous and typically iterative process 


(Jabi et al., 2021). 
Artificial Intelligence (AI) has focused on developing 


machines capable of performing human tasks. In order to 
achieve human performance capabilities, these machines 
can learn and extract insights from the data (Ronmi et al., 
2023). The research of methods computer systems use to 
complete categorization tasks without the need for 
external instructions, mostly utilizing learned models, is 
known as machine learning (Sufriyana et al., 2020). 
Supervised learning and unsupervised learning are the 
two main categories into which classification techniques 
for intrusion detection may be separated. The supervised 
approach uses the characteristics of the already available 
data as input to produce desired results, and the algorithm 
generates an inferred function known as a classifier or 
This approach provides quick 
minimal false alarm rates, and great 


regression function. 
calculation, 
accuracy. Unsupervised learning is a technique that 
produces results based on the properties of the current 
data input (Villalain et al., 2022). Boolean logic is a 


Int. J. Exp. Res. Rev., Vol. 39: 213-228 (2024) 


superset that has been expanded to include fuzzy logic, 
which can handle the idea of partial truth-to-truth. 
Random Forest 

In Random Forest, the planned strategy is to divide 
and conquer. It generates many Decision Trees that are 
all learned by selecting any subset of traits from the entire 
collection of predictor attribute values. Each tree is 
limited in how much it can grow by the property in the 
subset. The final Decision Tree will then be constructed 
for the projection of the test dataset using the average or 
weighted average method. A random forest differential 
measure can be used to quantify unstructured data. The 
data shown are the unaltered, original data collected from 
a variety of references. Since it can successfully handle 
mixed variable types and is invariant to repeated input 
variable changes, dissimilarity calculated using a random 
forest is favorable (Schonlau et al., 2020). 
Decision Tree 

The decision tree algorithm is within the domain of 
supervised learning algorithms. The decision tree 
approach can be utilized to fully understand the 
classification and regression problems, in contrast to 
conventional supervised learning algorithms (Franchuk et 
al., 2021). 
Naive Bayes Algorithm 

A supervised learning technique and a statistical 
classification technique are both demonstrated by the 
Nave Bayesian Classification. By calculating the 
probabilities of the results, the naive Bayes algorithm 
establishes an underlying probabilistic model and enables 
thoughtfully capturing ambiguity about the model. It may 
also be applied to answer diagnostic and predictive 
questions (Ghandi., 2018). 
Research Gap and Our Contribution 

Out of the examined articles, found a gap in the work 
of Vakadkar et al. (2021) who developed a prediction 
model to detect autism behavioral traits. By using the Q- 
CHAT-10dataset, the developed model predicts autism 
spectrum disorder with 93.84%, 81.52%, 94.79%, 
97.15%, and 90.52% accuracy in the case of Support 
Vector Machines, Random Forest Classifier, Naive 
Bayes, Logistic Regression, and K Nearest Neighbour, 
respectively, for better detection of autism spectrum 
disorder. Comparing all five supervised machine learning 
algorithms, LR and NB algorithms are practical for better 
detection of autism spectrum disorder. 
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This study uses principal component analysis (PCA) 
for feature selection and a _ k-fold cross-validation 
technique to optimize the result (Vakadkaret al., 2021). 
The accuracy of the model was 98.7%. The Bayes 
theorem was also used to test the reliability of the 
findings and the model is evaluated using the Confusion 
Matrix. Furthermore, the performance accuracy for the 
real dataset gave better and more reliable results. 


Method And Materials 

This chapter presents the methodology adopted for 
this research. The subheadings below were followed as 
the process was taken to develop the model and achieve 
the objectives of the study. 
Research Method 

The study employed a design science approach along 
with a quantitative experimental design. Abuja, the 
capital of Nigeria, is the only geographical area used. The 
aim of this research is to detect ASD in pregnancy among 
pregnant women at the National Hospital, Abuja. Figure 
1 show the proposed framework (PPRF), which illustrates 
the method used to accomplish the objective of this 
research. Data collection, exploratory data analysis, and 
preprocessing which comprise data cleansing and 
handling missing values, come first in the study. And 
then, Principal Component Analysis (PCA) was used for 
feature selection and k-fold for evaluation and validation. 
Following that, the three algorithms—Decision Tree, 
Random Forest, and Naive Bayes, are used throughout 
the model development phase. The characteristics of the 
models were taken into consideration when conducting 
the analysis. Results from the three classification models 
were contrasted. The classification models were 
evaluated using the confusion matrix, Receiver operating 
characteristic Curve (ROC), and Area under Curve 
(AUC). The best accuracy (98.7%) was achieved by 
using Random Forest algorithm. The model's reliability 
was assessed using Bayes Theorem. In the last stage of 
the experiment, the result is compared to findings from 
similar existing models. Python programming language 
was used to implement the models. 
Dataset and Data Collection 

The National Hospital Abuja, repository served as the 
source for the dataset. An interview between (the patients 
and the doctor) served as the data collection method. 
There are 998 samples. 
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Figure 1. Proposed Framework (PPRF). 


Table 1. Contents of Dataset. 


Attributes Values Numbers 
Class YES 499 
NO 499 
Preterm Birth YES 692 
NO 306 
Primipara YES 377 
NO 621 
Family History YES 254 
NO 744 
Personal History YES 274 
NO 724 
Gestational diabetes YES 655 
NO 343 
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Preeclampsia YES 691 
NO 307 
Deficiency in vit D YES 363 
NO 635 

Age Integer Values (in ranges) 

BMI Integer Values (in ranges) 


Table 2. First 11 rows of the Autism Dataset. 


Gestational 
diabetes 
preeclamps 
Personal- 
History 
Deficiency 
Primipara 
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Table 3. Attributes and Values of the Dataset. 


Z 
oe) 


Attributes Category Values 
Class YES 499 
NO 499 
Preterm Birth YES 692 
NO 306 
Primipara YES 377 
NO 621 
Family History YES 254 
NO 744 
Personal History YES 274 
NO 724 
Gestational diabetes YES 655 
NO 343 
Preeclampsia YES 691 
NO 307 
Deficiency in vit D YES 363 
NO 635 
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Table 1 displays the numbers and values of the 
dataset. This dataset has ten (10) attributes. The class 
column distinguishes the problem (Autism YES, Non- 
Autism, NO). Based on domain knowledge, there are 
nine (9) independent variables (Preterm Birth, Primipara, 
Family History, Personal History, Gestational diabetes, 
Preeclampsia, Deficiency in vit D, Age, BMI (KG), and 
one (1) dependent variable (Class). 

Table 2 displays the summary of the dataset's first 11 
records along with an overview of the data. Nine of the 
ten columns are attributes. The last column is the class 
column. 

Table 3 shows the class variable, which includes 499 
Yes to indicate that 499 people have autism and 499 No 
to indicate that 499 people don't, displays the category 
attributes and their values. 

Pre-processing 

In the pre-processing stage, values or feature changes 
that could have a negative effect on the model are 
carefully cleaned up and eliminated. Pre-processing data 
sought to change it into a format that could be used for 
model fitting and further analysis. During the pre- 
processing stage, pre-bias in the data were removed, and 
randomization techniques and Imputation of missing 
values were used. The technique employed in this 
research to prepare the data includes data integration, 
data 
management of missing and noisy data. Noisy data were 
smoothed before being used to forecast data that hasn't 


dimension _ reduction, standardization, and 


yet been observed, and outliers were found and 
eliminated from the dataset. The idea of exploratory data 
analysis was developed to explore, visualize, analyze, 
process, and interpret data variations and relations. 

A correlation plot was utilized to show how the 
features were correlated. The model used the column 
feature set as a vector of values, and these feature values 
or vectors were connected. Utilize the covariance 
coefficient of the feature vectors, which is obtained by 
taking the dot product of the two feature vectors, for this. 
The sign denotes the direction in which variation and the 
feature change with other features, either increasing or 
decreasing. Equation (1) provides the formula: 

Cov (x,y) =Sum((x-mean(x))(y-mean(y))) /n (Musa et 
al., 2024) (1) 

where x and y are the two features. 

Use correlation, which is the cosine angle between the 
two vectors, rather than covariance. Correlation values 
range from -1 to +1. It only provides the variation's 
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direction. 'std' stands for Standard Deviation in the 
formula. The formula is given by Equation (2): 

Correlation=corr (x,y)= Cov (x,y) /  std(x).std(y) 
(Musa et al., 2024) (2) 

Scale and dimension are not correlation factors. It is 
crucial to understand the correlations between the 
features to employ only one of the features with a very 
high correlation. After all, frequently refer to the same 
piece of knowledge. Additionally, you should only utilize 
features that have a weak association with the target 
values because they have minimal impact on a prediction 
but add to the model's complexity. The plot's outcome 
displays the size and the hues, which signify how the two 
features are related to one another. The closer the variable 
value is to 1, the stronger the correlation. To this end, 
Preterm birth and Preeclampsia have a high correlation, 
followed by Gestational diabetes and body mass index 
(BMI) with a low correlation. Additionally, it indicated 
that complex modeling methods might not be necessary. 
The 
Matplotlib were used to create a correlation plot for this 


Python programs and libraries Seaborn and 
investigation. 
Exploratory Data Analysis 

The idea of exploratory data analysis was developed 
to explore, visualize, analyze, process, and interpret data 
variations and relations. At this point, anomalies were 
found. Exploratory data analysis aids in displaying 
potential predictive models for the dataset presented. 
Histograms 


are frequently plotted for categorical 


variables, category counts, and numerical variables 
reflecting the distribution. 

When category (binary) dependent variables were 
employed to compare various numeric variables, the box 
and whisker plots of the few density plots overlapped. 
The variable density charts discriminated between ASD 
and NON-ASD. In Figure 2 below, the box plots make 
the outliers easier to see. The solid box primarily lies 
between the first and third quartiles. The isolated points 
are the outliers, while the bars represent the distributions' 
greatest and lowest points of numerous occurrences. They 
must be eliminated since they make the model less 
accurate. The two boxes display the continuous data's 
fluctuation with respect to the categories. 

Box plots are a graphical representation of the 
numerical data through their quartiles. The lower and 
like boundaries of the data 


upper whiskers are 


distribution. 
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Figure 2. Box and Whisker Plot of the dataset. 
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Figure 3a. Distribution and Statistical Summary of the Dataset. 
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Figure 3b. Distribution and Statistical Summary of the Dataset. 
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Figure 3c. Distribution and Statistical Summary of the Dataset. 


Minor overlap is shown in the graphical distribution 
of the attributes in Figures 3a, 3b, and 3c, but there are 
different distributions for each class of values for each of 
the attributes. This is a promising indication that the 
qualities and classes can be distinguished. Despite this, 
only a small number of attributes may have a Gaussian- 
like distribution in any way. This demonstrates that the 
distribution may likely become Gaussian when more data 
are collected. A skew or a significant number of 
occurrences towards the top right end of the distribution 
may be present for some characteristics, such as BMI. 
Also, the attribute age does not increase or is irrelevant to 
the model's predictive value. The figures within the lines 
show each group's absolute frequencies throughout the 
entire data collection. Several inferences can be made 


parameter to a model by using only trivial data and 
eliminating noise. Data preparation for modeling was 
done. A significant part of the data preparation involved 
transforming the dataset to include rescaled attribute 
values and attributes broken down into components. The 
process of feature selection was also employed to prepare 
the data. To deal with redundant features in the data that 
can reduce the models' accuracy, feature selection was 
carried out. 

A set of information used to train predictive models 
significantly impacts how well they function; as a result, 
feature selection speeds up the training process while 
addressing issues like overfitting and misleading data. 
The attributes in the data that are most important for 
prediction were chosen using principal component 


from the graph. analysis, or PCA. 
Feature Selection using Principal Component Machine learning typically relies on differences in 
Analysis (PCA) how Data Points behave across various classes. Low 
Feature reduction is the process of reducing the input variance implies that points will tend to cluster around 
Before PCA After PCA 
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Figure 4. Feature Importance from PCA Analysis. 
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one another, making it difficult to distinguish between 
them. 

In Figure 4, the result of PCA and important features 
are shown. Preterm birth, Preeclampsia, Gestational 
diabetes, and BMI are correlated with ASD and have a 
higher importance in the matrix. ASD data has some 
characteristics. Depending on the feature, these ranges 
can vary between 0 and 1; while others have values 
between 10 and 100. The scale utilized in the study, 
which has values ranging from 0 to 1, is known as Min- 
Max Scaling. The variation was greatest in the PCA 
space along PC1, which accounts for 23% of the 
variance, and PC2, which accounts for 15% of the 
variance. Together, they account for 38%. The maximum 
variance proof can also be seen by calculating the 
covariance matrix of the smaller space. 

Algorithm Training and Model Development 

The study used three machine learning classifiers: 
Decision tree, Random Forest, and Naive Bayes. 
Employed k-fold cross-validation, each fold's sample is 
randomly selected without replacement. Specifically, two 
approaches were considered. 

i. Model evaluation of the baseline data (original data) 
ii. Model evaluation of the data with feature 
importance 

The best-performing model on the validation set was 
tested on the test set to evaluate how well it would 
typically hold. To increase accuracy, the model's training 
was done again with new input parameters for the 
selected attributes. The model performance was enhanced 
using a k-fold, which is more reliable. 

Evaluation of Model Performance 

The confusion matrix, ROC-AUC, and Bayesian test 
are the parameters used to assess and test the model's 
performance and reliability. 

Confusion Matrix 

The number of correct and incorrect predictions made 
by the classification model in relation to the actual results 
of the data (the goal value) is displayed in the confusion 
matrix. N is the number of target values (classes), and the 
matrix is NxN. It is common practice to evaluate the 
efficacy of such models using the information in the 
matrix. The effectiveness of the model was assessed 
using a confusion matrix. The confusion matrix was used 
to determine whether the model with binary goal values 
was applicable (O or 1). Each value of the confusion 
matrix was established using the results of the model 
testing. 

True positives (TP): Number of correctly classified 
tuples from the positive class. 
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False negatives (FN): Number of incorrectly classified 
tuples from the positive class. 
False positives (FP): Number of incorrectly classified 
tuples from the negative class. 
True negatives (TN): Number of correctly classified 
tuples from the negative class. 
Accuracy: Proportion of correctly classified tuples 
(Musa et al., 2024). 
TPR = No.of correctly predicted PE 
* 100/Total number of PE 
TNR = No.of correctly predicted nonE 
* 100/Total number of non PE 
The formula is given by Equation (3): 


FNR = —"_ (3) 
TP+ FN 


The false Positive Rate accounts for the percentage of 
the negative class incorrectly classified by the classifier. 
The formula is given by Equation (4): 


fire 
TN+ FP 
FPR = | — Specificity (4) 


F-Score: In this work employed the Fl-score as the 
main measure for assessing models, which is described as 
the harmonic mean between the two (Han et al., 2012). 
The F1 score depends on whether the class is classified as 
positive according to how accuracy and recall are 
defined. High accuracy is important for prognosis since 
false positives are expensive (Musa et al., 2024). 

Fl Score: it’s the combination of sensitivity and 
recall. It’s represented by (2 * Precision * Recall) / 
(Precision + Recall) (Musa et al., 2024). The formula is 
given by Equation (5): 

F-score=2* ((precision *recall) / (precision + recall)) 

(5) 

Precision: is the division of the positive cases 
correctly identified by all the cases identified as positive 
(including false positives) (Musa et al., 2024). The 
formula is given by Equation (6): 


Precision = 7P/(TP+FP) (Hossain et al., 2021) 
(6) 
Recall: Recall and true positive rate (TPR) is the 


proportion of positive tuples that were correctly 
classified. However, recall is also crucial since finding a 
decent amount of possible risk is what the problem is all 
about. The formula is given by Equation (7): 

Recall = TP/(TP+FN) (Hossain et al., 2021) 

(7) 

False positive rate (FPR). FPR is the proportion of 
incorrectly classified negative tuples False Positive Rate 
(FPR) = FP / (F'P+TN) (Musa et al., 2024) 

Accuracy: represents the number of 
predictions divided by the total number of predictions. In 


correct 
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categorization tasks, accuracy is the percentage of correct 
predictions the model makes over all other predictions. 
Accuracy is a useful metric in the highly equilibrated 
target variable classes of data (Musa et al., 2024). The 
formula is given by Equation (8): 

Accuracy = (TP+TN) / (TP+FN+FP+TN) 

Specificity: 
examples: it’s the division of the true negatives by all the 
real negative cases (including false positives) (Musa et 


(8) 


the same as recall, but for negative 


al., 2024).The formula is given by Equation (9): 
Specificity = TN/ (TN + FP) (9) 
Sensitivity: is the division of the true positives by all 

the real positive cases (including false-negative cases). 
When striving to have more true positives than true 
negatives, it counts. The model's sensitivity measures its 
capacity to detect preeclampsia. The percentage of 
positive cases predicted to be positively served as the 
sensitivity measure. Sensitivity describes the percentage 
of the positive class correctly classified and inferred from 
the confusion matrix (Musa et al., 2024). The formula is 
given by Equation (10) below: 

Sensitivity = TP / (TP + FN) 

(10) 

Error rate = 1 — Accuracy 

(11) 

Receiver Operating Characteristic (ROC) and Area 

under the Curve (AUC) 

The Receiver Operator Characteristic (ROC) is a 
curve for experimental binary class problem metrics. It is 
used to visualize the performance of a binary classifier 
and it shows the trade-off between the true positive rate 
and the false-positive rate. A ROC area of 1 represents a 
perfect test and 5 represents a worthless test. It is an 
excellent method for measuring the performance of a 
Classification model. The True Positive Rate (TPR) is 
plotted against the False Positive Rate (FPR) for the 
probabilities of the classifier predictions (Musa et al., 
2024). Then, the area under the plot is calculated. 

The curve shows the trade-off between the true and 
false-positive rates. 


Table 4. Result of Classifiers. 


Models Time 


taken (s) 


Accuracy 
(%) 


Proposed Model 
(PPRF) 
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Precisio F 


True Positive Rate = True Positive / (True Positive + 
False Negative) 

False Positive Rate = False Positive / (True negative + 
False Positive)(Musa et al., 2024). 

The Area under the Curve (AUC) was interpreted as a 
criterion of the capacity of a classifier to differentiate 
between different classes of the plots and is also used as 
an overview of the ROC curve. A strong model has an 
AUC close to 1, indicating that it has a high level of 
distinction. An AUC close to 0, which indicates the 
poorest measure of separability, indicates a bad model. It 
indicates that the outcome is being reversed. It predicts 
both 1s and Os as 1. Additionally, when AUC =1, it 
means that the classifier can perfectly distinguish 
between all the positive and negative class points. 
Bayesian Test 

To reassess the model's performance with known 
sensitivity, the study ran a Bayesian test on the model of 
choice. The concept of the Bayesian theorem was 
implemented. The formula for Bayes’ theorem is given in 


Equation (12) below: 
_ P(A)P(A) 
P(B) = GS (Musa et al., 2024). 


(12) 

Comparison of Findings with Existing Model 

The benchmark study (Vakadkar et al., 2021), which 
was used as a comparison, had an accuracy of 97.15%, 
and theQ-CHAT-10dataset (1054 datasets) and 18 
attributes were collected from people with autistic and 
without autistic symptoms were used to evaluate the 
study model. 


Analysis 

The data was analyzed to cover the stages of the 
machine learning model design, and the results were used 
to achieve the study objective of predicting autism 
spectrum disorder using the PCA and ML algorithms. 


Model Results 
Table 4 demonstrates the results of the three 
algorithms used in the study and, using the ROC curve 


Sensitivit | Specificit _ROC/AU 
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evaluates the 
performance. Three 
algorithms were used in the analysis. The characteristics 


and confusion metrics 
classification model's 


parameters 
visual 


of the models were taken into consideration when 
conducting the research. The visual of how well the 
categorization model performs is evaluated using the 
ROC curve. In the given situation, choose a threshold 
level using the curve that strikes a balance between 
sensitivity and specificity. A flawless classifier will have 
a ROC AUC of 1, while an entirely random classifier will 
have a value of 0.5. 

The result (Table 4) showed that random forest has the 
best accuracy at 98.7% with a standard deviation of .084. 
The model took 0.0312seconds to perform or run. The 
precision score was 0.96, the recall (sensitivity) was 0.97, 
and the F-score was 0.93. The specificity was 0.79, and 
ROC was 0.96. Showing the reliability of the classifiers 
to predict or tell the nature of diseases (ASD). 

The result for the Decision Tree showed an accuracy 
of 95.6% with a standard deviation of .079. The model 
took 0.0562 seconds to perform or run. The precision 
score was 0.94, the recall (sensitivity) was 0.92, and the F 
score was 0.91. The specificity was 0.79, and ROC was 
0.71. 

The result from Naive Bayes showed that the accuracy 
was 97.2% with a standard deviation of .082. The model 
took 0.0725 seconds to perform or run. The precision 
score was 0.96, the recall (sensitivity) was 0.91, and the F 
score was 0.89. The specificity was 0.67, and ROC was 
0.69. 


From Table 5, accuracy of 98.7% proves that it is 
relatively more successful in predicting ASD; this was 
also supported by the domain knowledge and _ the 
possibility that the disease is relatively common in the 
general population of the sample data. The F-score is 
lower than accuracy measures as it embeds precision and 
recalls into its computation. 


Reliability Testing (Bayes Theorem) 

The strong classifier can be given additional 
influence, which considerably improves classification 
performance, by using the Bayesian formula to 
dynamically update the weight value for each tree (Zhang 
et al., 2021). By considering the incidence rate and 
applying the Bayes theorem, we can assess how 
effectively the machine learning classifier works and 
determine whether the result is reliable enough to be used 
in a typical clinical scenario. Analyze the specificity and 
sensitivity of the result to determine its validity using the 
Bayes theorem. That is how frequently true positives and 
true negatives are found by the test. This makes it easier 
to assess the value of binary classifiers. 

Event A = unconditional probability of this disease in 
the population (population = 998; those with diseases = 
50%, those without disease = 50%, specificity/sensitivity 
= 97%, 1- specificity = 0.03) 

P(A) = presence of the disease 

P(A) = 0.50. 

Event B= unconditional probability of our test coming 
up positive. 


Table 5. Proposed Model and Similar Existing Model on Autism Spectrum Disorder (ASD) 


Prediction. 


Parameters/ 
Evaluation 
model 


Proposed model (PPRF) 


Parameters Accuracy = 98.7% 


Time taken = 0.0312s 
Specificity = 0.79 
Precision = 0.96 
Population = 998 
F score = 0.93 


Comparison with Similar Existing Work 

The findings of this study, as shown in Table 5, 
compared the result of Random Forest ASD prediction 
with (Vakadkar et al., 2021), which had an accuracy of 
97.15%. The algorithm of (Vakadkar et al., 2021) is 
being implemented under the same environment by 
reducing the data set to 998 samples, which got an 
accuracy of 97.7%. 
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After comparison with a 
similar existing model 
(Vakadkar et al., 2021) 


Accuracy = 97.7% 
Time taken = 0.0553s 
specificity = 0.71 
precision = 0.97 
Population = 998 
F score = 0.91 


Similar Existing model 
(Vakadkar et al., 2021) 


Accuracy = 97.15% 
Specificity = 0.68 
Precision = 0.90 
Population = 1054 
F - score = 0.89 


P(Bt) (True Positives) = 998 * .50 * 0.97 = 484.03 

P(Bf) (False Positives) = 998 * .50 * 0.03 = 14.97 

P(B) (total positives) = (484.03 + 14.97) /998 = 0.5 = 
50.0% 

P(BJA) = probability of getting a positive result 
regardless of whether it’s a true-positive or not, 

Thus P(B|A) = sensitivity. 

P(BJA) = 0.97. 

P(AIB) = 0.97 * 0.5/0.5 = 97% 
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In all the calculations, Random Forest in the study 
performed better than the other models because of its 
ability to use different feature subsets and decide at 
different classification The outcomes 
demonstrated that the model's ROC AUC _ was 
approaching 1. Thus, the classifier is likely effective at 


stages. 


determining if a pregnant woman is carrying a child that 
will likely be born with ASD. 

The drawback of the model is that the total sample 
used for training the model was small when compared to 
the work of (Vakadkar et al., 2021). However, the study 
extends the scope of literature by subjecting results to the 
Bayes theorem to evaluate its overall reliability. 


Conclusion 
This section presents summary, contribution to the 
body of knowledge, conclusion and recommendation. 


Summary and Discussion 

The study developed an autism spectrum disorder 
prediction model using machine learning tools and 
algorithms. Predicting the disorder in pregnancy among 
pregnant women was possible using PCA, and machine 
learning tools. The machine learning model of focus was 
the random forest, decision tree, and naive Bayes. The 
findings showed that PCA and Random Forest could 
predict the occurrence of ASD, and results from the 
Random Forest showed that the accuracy was 98.7%, 
with a standard deviation of .084. 


Contribution to the Body of Knowledge 

To forecast ASD, the study developed a supervised 
model. This model was improved using the unsupervised 
learning clustering technique, which selected the best 
feature and optimized the threshold while predicting a 
class. The most significant features of the patients that 
cause the diseases were uncovered in the research. The 
strongest correlated aspects of the diseases were 
visualized using machine learning algorithms because 
interpretability is a challenge for non-computer data 
scientist professionals. The study advances knowledge of 
how critical it is for physicians to use machine learning 
and feature-importance algorithms to detect ASD in 
pregnant women at risk before birth. Although models 
cannot capture the clinical impacts caused by bias or 
incorrect calibration, the investigation's usage of a 
Random Forest yielded encouraging results. It was 
discovered that the Bayes theorem validates the model's 
reliability. This the 


advantages for patients 


method considerably boosted 
and provides a 


threshold 

probability-based evaluation viewpoint on _ predicting 
ASD. This work contributes to the body of knowledge 
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regarding the use of PCA and machine learning 
algorithms in the prediction of ASD. 


Conclusion and Recommendations 

In this study, classification, clustering, and machine 
learning techniques were proposed for the prediction of 
ASD in pregnant women. The results revealed that 
characteristics that can predict ASD among pregnant 
women at risk of having babies with ASD were analyzed 
using PCA and Machine learning methods in this 
research. The resulting Machine learning model is 
utilized by doctors as a useful prediction tool to early 
detect people who are pregnant with the condition. 
Another conclusion from this research is Primipara, Age, 
and Personal History do not improve the prediction 
efficacy of the ASD model. 

It is recommended that to enhance autism spectrum 
disorder management and _ prediction, healthcare 
professionals should collaborate extensively with hospital 
data scientists to strengthen predictions. Ensemble 
methods should be used to improve final algorithms. 
These algorithms improve the results of autism spectrum 


disorder prediction in routine prenatal care of the fetus. 


Limitations and Suggestions for Further Study 

Future studies should incorporate both maternal and 
paternal features to see whether additional pertinent 
information can be acquired as this study solely used 
maternal factors. Secondly, because of the size of the data 
used, in communities whose disease incidence is lower, 
there are signs of a greater incidence and findings that 
might not be useful. Future research should therefore 
incorporate these ideas since studies with a larger 
population and more attributes may open new 
perspectives of knowledge and help to solve the problem 
of ASD prediction. Future research can utilize more 
longitudinal survey data samples for that purpose. 
Adopting deep learning models and other hyperparameter 
tunings will allow this work to be expanded. Additional 
social demographic data must be considered as this was 
not extensively worked on in this research. And finally, 
the development of a smartphone or mobile application to 
assist both medical professionals and patients in early 
ASD detection. 
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